Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[benchmark] add option to enable CompiledAutograd #1536

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

crcrpar
Copy link
Collaborator

@crcrpar crcrpar commented Dec 10, 2024

What does this PR do?

CompiledAutograd seems to speed up FSDP2 and I checked it with torchtitan.
I however somehow do not find it beneficial for litgpt models.

setting: pjnl-20241209, 8H100

torchtitan Llama-3-8B

This uses activation checkpointing, since the provided config by default uses it -- https://github.com/pytorch/torchtitan/blob/05a8b5e4c1de979c4b49ff36e6b09d6055db29b1/train_configs/llama3_8b.toml#L53-L55

CompiledAutograd Performance (tps) Memory (GB)
N 6244 51.2
Y 7200 43.0

litgpt llama-2-7b-hf

CompiledAutograd Performance (tokens/s/GPU) Memory (GB)
N 11722.76 39.13
Y 10702.33 52.61

@crcrpar crcrpar force-pushed the crpa/litgptbench_compiledautograd branch from fa70057 to b772a75 Compare January 6, 2025 05:50
@IvanYashchuk
Copy link
Collaborator

Have you investigated further why this option helps the torchtitan model code and why there are no improvements here?

@crcrpar
Copy link
Collaborator Author

crcrpar commented Feb 3, 2025

no, I haven't had enough bandwidth for it

@riccardofelluga
Copy link
Collaborator

@IvanYashchuk Have you investigated further why this option helps the torchtitan model code and why there are no improvements here?

It looks like torchtitan uses compiled_autograd only for ddp and not fsdp. In this PR where they added the compiled_autograd option, they did it only for ddp. While, from the comments they seem confident that compiled autograd brings an advantage for ddp, they don't seem to be as optimistic for fsdp.

Here implemented the ddp compiled_autograd option: https://github.com/pytorch/torchtitan/blob/49c6d6fc15ef644e5c3b1003ad4e0d9ea5fcb9a9/torchtitan/parallelisms/parallelize_llama.py#L105-L110

@IvanYashchuk
Copy link
Collaborator

Does the all reduce operation appear in the FX graph with DDP + compiled_autograd?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants